[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors#5141
[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors#5141zhouyifan279 wants to merge 5 commits intoapache:masterfrom
Conversation
| adjustTargetNumExecutors = true, | ||
| adjustTargetNumExecutors = false, | ||
| countFailures = false, | ||
| force = false) |
There was a problem hiding this comment.
Shall we call client.requestTotalExecutors to adjust the target executor if targetExecutors < draTargetExecutors after kill executors ?
We might face a issue similar with apache/spark#19048
There was a problem hiding this comment.
IMO, we just need to tune adjustTargetNumExecutors from true to false and call client.requestTotalExecutors to ensure the final target executor is expected.
There was a problem hiding this comment.
Agree. But build parameters of client.requestTotalExecutors involves many details.
So I prefer to wait ExecutorAllocationManager to call client.requestTotalExecutors.
As long as we did not call killExecutors with adjustTargetNumExecutors = true, ExecutorAllocationManager should be able to manage target executor num correctly.
|
|
||
| if (draTargetExecutors <= targetExecutors) { | ||
| // Ensure target executor number has been updated in cluster manager client | ||
| executorAllocationClient.requestExecutors(0) |
There was a problem hiding this comment.
I'm not sure I get it correctly. It seems we should do nothing if the dra target executor is smaller than our target executor. If it happens, the executors are pending to kill.
Codecov Report
@@ Coverage Diff @@
## master #5141 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 566 566
Lines 31590 31654 +64
Branches 4120 4124 +4
======================================
- Misses 31590 31654 +64 see 20 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ourceManager killed all executors
92aaa08 to
9dcbc78
Compare
…ourceManager killed all executors
…ourceManager killed all executors
| spark: '3.3' | ||
| spark-archive: '-Dspark.archive.mirror=https://archive.apache.org/dist/spark/spark-3.1.3 -Dspark.archive.name=spark-3.1.3-bin-hadoop3.2.tgz -Pzookeeper-3.6' | ||
| exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest' | ||
| exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest,org.apache.kyuubi.tags.SparkLocalClusterTest' |
There was a problem hiding this comment.
Task serialized by Spark 3.3 Driver can not be deserialized by Spark 3.1.3 Executor
|
cc @ulysses-you @pan3793 Ready for review. |
| eventually(timeout(Span(10, Minutes))) { | ||
| sql( | ||
| "CREATE TABLE final_stage AS SELECT id, count(*) as num FROM (SELECT 0 id) GROUP BY id") | ||
| } |
There was a problem hiding this comment.
shall we getAdjustedTargetExecutors at the end to make sure the target executor number is 1 ?
|
Thanks, merged to master |
Why are the changes needed?
In minor cases, Spark Stage hangs forever when spark.sql.finalWriteStage.eagerlyKillExecutors.enabled is true.
The bug occurs if two conditions are met in the same time:
Target executor num in YarnAllocator will be set to 0 and no more executor will be launched.
Then ExecutorAllocationManager will not sync target executor num to YarnAllocator.
How was this patch tested?
FinalStageResourceManagerSuite